Confidence intervals for the true classification error conditioned on the estimated error.
نویسندگان
چکیده
Bias and variance for small-sample error estimation are typically posed in terms of statistics for the distributions of the true and estimated errors. On the other hand, a salient practical issue asks, given an error estimate, what can be said about the true error? This question relates to the joint distribution of the true and estimated errors, specifically, the conditional expectation of the true error given the error estimate. A critical issue is that of confidence bounds for the true error given the estimate. We consider the joint distribution of the true error and the estimated error, assuming a random feature-label distribution. From it, we derive the marginal distributions, the conditional expectation of the estimated error given the true error, the conditional expectation of the true error given the estimated error, the conditional variance of the true error given the estimated error, and the 95% upper confidence bound for the true error given the estimated error. Numerous classification and estimation rules are considered across a number of models. Massive simulation is used for continuous models and analytic results are derived for discrete classification. We also consider a breast-cancer study to illustrate how the theory might be applied in practice. Although specific results depend on the classification rule, error-estimation rule, and model, some general trends are seen: (I) if the true error is small (large), then the conditional estimated error is generally high (low)-biased; (II) the conditional expected true error tends to be larger (smaller) than the estimated error for small (large) estimated errors; and (III) the confidence bounds tend to be well above the estimated error for low error estimates, becoming much less so for large estimates.
منابع مشابه
The reliability of estimated confidence intervals for classification error rates when only a single sample is available
Error estimation accuracy is the salient issue regarding the validity of a classifier model. When samples are small, training-data-based error estimates tend to suffer from inaccuracy and quantification of error estimation accuracy is difficult. Numerous methods have been proposed for estimating confidence intervals for the true error based on the estimated error. This paper surveys proposed me...
متن کاملA New Formulation for Cost-Sensitive Two Group Support Vector Machine with Multiple Error Rate
Support vector machine (SVM) is a popular classification technique which classifies data using a max-margin separator hyperplane. The normal vector and bias of the mentioned hyperplane is determined by solving a quadratic model implies that SVM training confronts by an optimization problem. Among of the extensions of SVM, cost-sensitive scheme refers to a model with multiple costs which conside...
متن کاملEstimation in Simple Step-Stress Model for the Marshall-Olkin Generalized Exponential Distribution under Type-I Censoring
This paper considers the simple step-stress model from the Marshall-Olkin generalized exponential distribution when there is time constraint on the duration of the experiment. The maximum likelihood equations for estimating the parameters assuming a cumulative exposure model with lifetimes as the distributed Marshall Olkin generalized exponential are derived. The likelihood equations do not lea...
متن کاملبرآورد فاصله اطمینان برای نسبتهای نزدیک به صفر و یک: یک مطالعه ثانویه مدل سازی
Background and Objectives: When computing a confidence interval for a binomial proportion p, one must choose an exact interval that has a coverage probability of at least 1-α for all values of p. In this study, we compared the confidence intervals of Clopper-Pearson, Wald, Wilson, and double ArcSin transformation in terms of maintaining a constant nominal type I error. Methods: Simulations w...
متن کاملArea specific confidence intervals for a small area mean under the Fay-Herriot model
‎Small area estimates have received much attention from both private and public sectors due to the growing demand for effective planning of health services‎, ‎apportioning of government funds and policy and decision making‎. ‎Surveys are generally designed to give representative estimates at national or district level‎, ‎but estimates of variables of interest are oft...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Technology in cancer research & treatment
دوره 5 6 شماره
صفحات -
تاریخ انتشار 2006